Unsupervised Learning
src: https://www.youtube.com/watch?v=8dqdDEyzkFA rough transcript (needs work): guys I think I found it classified tree of life hello world it's Suraj and the most exciting class of machine learning techniques is called unsupervised learning teaching machines to learn for themselves without having to be explicitly told if everything they do is right or wrong is the key to true artificial intelligence and perhaps the most important research goal of I mean how else are we going to get to fully automated luxury day space communism in this episode I'm going to give you a broad overview of this area as well as teach you two of the most popular unsupervised learning techniques principal component analysis and k-means clustering in order to save someone's life a patient at a hospital has been suffering from several epilepsy related seizures luckily we have a data set of their neural activity recorded by electrodes that were inserted into their brain the lead surgeon asked us to use unsupervised learning techniques on this neural data to find out what part of their brain is causing the seizures so they can perform surgery on it will we save the patient's life we'll find out at the end of this video and subscribe if you want to keep learning about AI technology for free we can divide machine learning into two types supervised and unsupervised there's also reinforcement learning but that only applies in a real-time environment it's not a static data spreadsheet there's also quantum machine look can you please keep it simple for once so supervised learning is synonymous with pattern matching it's done using the ground truth meaning we have prior knowledge of what the output values for our input data should be you know that hot dog not hot dog classifier trope from the popular show Silicon Valley that's supervised learning my life is literally that show so I don't watch it the goal is to approximate the relationship between input and output data most machine learning across every industry is done this way it's easy it's straightforward and it tends to perform very well if given enough samples but clean perfectly labeled datasets aren't always easy to find in fact % of the world's data is unstructured the goal of unsupervised learning is to automatically find structure in a data set this can itself be the goal discovering hidden patterns in data or a means to an end to learn what the most relevant features are we can further subdivide unsupervised learning into different types of techniques clustering finds data points similar to each other and groups them together if we had any kind of population data whether we were a government organization or a start-up with a product like diet water yes that's real basically anyone trying to reach a certain set of people we want to segment that population into smaller clusters with similar demographics and purchasing habits so that we could target them most effectively spending our marketing budget anomaly detection finds the outliers in a collection of data points banks uses to find fraudulent transactions Association finds correlated features between data points then lets us infer other features of a given data point Airbnb uses to recommend other listings you probably like and dimensionality reduction reduces the number of features in a data set which makes it easier to visualize and interpret Yamla Coon director of AI research at Facebook puts it best with his quote if intelligence was a cake unsupervised learning would be the cake supervised learning would be the icing on the cake and reinforcement learning would be the cherry on the cake we now know how to make the icing and the cherry but we don't know how to make the cake talk about strange but weirdly effective metaphors here's to you young so let's take a look at our data to decide what to do with it this is a -minute long recording of neural data from an epilepsy patient a set of electrodes were inserted into the brain of this patient to record the activity of neurons in real-time it picked up electrical spikes of neurons and we can see several features here that relate to the recording devices measurements like the channel number frequency and the number of samples let's first visualize this data using Digital alchemy aka Python we want to extract spikes from the signal and to do that we'll find data points in the signal that are above some predefined threshold and align them at their peak amplitude we can do this with just random spikes and see that there are at least two types of waveforms in the data one group of spikes with a sharp high amplitude peak and a second group with a broader initial peak these bites were likely generated by more than one neuron if we can find a way to group these waveforms into different clusters it will help us figure out which spike corresponds to which neurons which will help surgeons decide where to perform surgery but in order to cluster the waveforms we're going to need to decide which features to input to our algorithm one possible feature it could be for example the peak amplitude of the spike or the width of the waveform but not all features are equally informative and useful we need to select the features that represent the spike wave shapes the best and get rid of the rest for our prediction to be accurate the way we're going to do that is to use a type of unsupervised learning called dimensionality reduction of which there are several techniques like brute force no we're going to use a popular one called principal component analysis or PCA PCA finds the principal components of a dataset principal components are the underlying structure in the data they are the direction where there is the most variance meaning where the data is most spread out it's useful to measure data in terms of principal components rather than on a normal XY axis imagine that we had a bunch of data points which will denote as Triforce symbols as an ode to the princess to find the direction with the most variance we can find the straight line where the data is most spread out when projected onto it a vertical straight line with the points projected onto it will look kind of like this not very spread so there's a small variance like lino principal component here a horizontal line however with lines projected onto it looks way more spread out a high variance there's no straight line we can draw that has a larger variance than a whore's on two one thus the horizontal line is the principal component in this example to find principal components we use linear algebra one of the mathematical pillars of machine learning two concepts here iDEN vectors which have a direction and eigenvalues which are numbers that tell us how much variance there is in the data in that direction these two concepts come in pairs like in and yang and the eigenvector with the highest eigenvalue is the principal component in a three dimensional data set there are three variables imagine all the data points lie on a piece of paper sized plane in this d graph when we find that three eigen vectors and values two will have large I can value z' and one of the eigenvectors will have an eigen value of zero if we rearrange our axes to be along the eigen vectors rather than the original variables discarding the third one we essentially get rid of the useless direction and are able to represent it in two dimensions we can do this in a single line thanks to scikit-learn we just need to specify how many components we want when I find myself with features mother and Elle comes to me predicting just the best ones let it be once we've reduced the dimensionality of our data we're ready to perform clustering the second type of unsupervised learning a popular clustering technique is called k-means first we choose a number of K random data points from our sample these represent the cluster centers and their number equals the number of clusters then we calculate the distance between all the random cluster centers and any other data point we then assign each data point to the cluster Center closest to it since we started with random data points it won't give us a great result so we repeat the process and instead of using random data points as cluster centers we calculate the actual cluster centers based on the previous random assignment this just keeps repeating and with every iteration the data points that switch clusters go down and we arrive at a global optimum we're now in the coochie gang a newer version of the hoochie game a question arises though how do we choose the number of clusters we could try running k-means multiple times with different cluster numbers when we plot the result we can analyze it to see if we chose too many clusters too few or just the right amount based on our domain knowledge we can expect to find more than two or three separable clusters from a single electrode recording in our plot seems to confirm this notion another way to decide this is to use the elbow method the way that this works is to run k-means several times and increase the number of clusters every run and during every run we calculate the average distance of each data point to its cluster Center the number of clusters increases and the average inter cluster distance decreases when we reach six clusters the average distance to the cluster Center does not change any more and this is called the elbow point it gives us a recommendation of how many clusters we should use by clustering the data we're able to sort the neuron spiked into distinct regions which correlate to different parts of the brain this is going to be supremely helpful for our client at the hospital and we just use data science to save a patient's life before we pop champagne there are three things to remember from this video unsupervised learning helps find previously unknown patterns in a data set without needing a label principal component analysis is a dimensionality reduction technique that helps find the most relevant features in a data set and k-means clustering is the most popular clustering technique grouping similar data points together for further analysis what is your next data science project let me know in the comment section and please subscribe for more programming videos for now I've got to find myself so thanks for watching